Machine Translation Using Vector Space Representations

ABSTRACT

Disclosed herein are methods, articles of manufacture, and systems for translating text. Such a method includes generating a conceptual representation space based on a plurality of source-language documents and a plurality of target-language documents. The method also includes generating, in the conceptual representation space, respective representations of a new source-language document and each of a plurality of dictionaries. The method further includes selecting a first dictionary from the plurality of dictionaries responsive to a similarity between the representation of the new source-language document and the representation of the first dictionary. The method still further includes translating, by using the first dictionary, a term in the new source-language document into a target-language term.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.11/408,957 to Bradford, entitled “Machine Translation Using Vector SpaceRepresentations,” filed on Apr. 24, 2006, now allowed, which applicationclaims benefit under 35 U.S.C. §119(e) to U.S. Provisional PatentApplication 60/674,705, entitled “System And Method For Improved MachineTranslation Using Vector Space Representation,” to Bradford, filed onApr. 26, 2005, the entirety of each of the foregoing applications ishereby incorporated by reference as if fully set forth herein.

BACKGROUND

1. Field of the Invention

The present invention is generally directed to the field of machinetranslation.

2. Background

Translation of text from one human language into another is important inmany commercial and governmental activities, as well as having personalapplications. Translation of text by human translators is time-consumingand expensive. There is a substantial need for automated means ofcarrying out the translation function. Numerous approaches have beenapplied in software for automated machine translation. However, as willbe described in more detail below, the quality of the output fromcontemporary machine translation systems is generally well short ofdesired performance.

Machine translation software converts text from one human language (thesource-language) into another (the target-language). Despite 50 years ofdevelopment, the capabilities of automated machine translation systemsare still discouragingly limited, as discussed in Machine Translation:an Introductory Guide, NCC Blackwell, London, 1994, ISBN: 1855542-17×.Major approaches applied in machine translation are: (i) rule-basedsystems; (ii) example-based systems, and (iii) statistical machinetranslation.

Even for the simplest of language pairs (for example, English andSpanish), complex sentences and idiomatic expressions are often poorlyhandled. For more difficult language pairs (for example, English andArabic), the meaning of sentences is often garbled. With the presentstate-of-the-art, the applicability of machine translation is limited.

A key problem in machine translation is the lack of fidelity with whichtranslated text reflects the meaning and tone of source text. Forexample, machine translation systems have problems in several areas,including:

1. Word sense disambiguation. In human languages, many words havemultiple meanings. For example, the English word “strike” has dozens ofcommon meanings. Examples of poor machine translation typically involvean incorrect choice of word sense.

2. Idiomatic expressions. Better capabilities should be developed todeal with idiomatic expressions, such as “kicked the bucket” or “good asgold.”

3. Anaphora resolution. Machine translation systems have difficultiesresolving ambiguous references.

4. Logical decomposition. Machine translation systems have difficultiesdecomposing long sentences into coherent textual elements, particularlyfor languages such as Arabic.

Therefore, what is needed is a system and method for improving theperformance of machine translations. For example, the improvement shouldmore effectively deal with word sense ambiguity, idiomatic expressions,anaphora resolution, and logical decomposition.

BRIEF SUMMARY

In accordance with the present invention there is provided a system andmethod for improving the performance of machine translations. Aconceptual representation afforded by an abstract mathematical vectorspace (such as, a Latent Semantic Indexing (LSI) space) addresses themachine translation problems by more effectively dealing with, interalia, word sense ambiguity, idiomatic expressions, anaphora resolution,and logical decomposition.

An embodiment of the present invention provides a method forautomatically translating text, including the following steps. First, aconceptual representation space is generated based on source-languagedocuments and target-language documents, wherein respective terms fromthe source-language and target-language documents have a representationin the conceptual representation space. The conceptual representationspace may be, for example, a Latent Semantic Indexing (LSI) space.Second, a new source-language document is represented in the conceptualrepresentation space, wherein a subset of terms in the newsource-language document is represented in the conceptual representationspace, such that each term in the subset has a representation in theconceptual representation space. In an LSI-based example, therepresentation of each respective term may be a vector representation.Then, a term in the new source-language document is automaticallytranslated into a corresponding target-language term based on asimilarity between the representation of the term and the representationof the corresponding target-language term. In an LSI-based example, thesimilarity may be a cosine similarity between vector representations.

In an example, the above-mentioned embodiment may include a method fordisambiguating words at a word-level, which includes the followingadditional steps. A disambiguated conceptual representation space isgenerated for at least one of the source-language documents. In thedisambiguated conceptual representation space, a polysemous wordcontained in the at least one source-language document has a pluralityof representations, wherein each representation of the polysemous wordcorresponds to a sense of that word. A representation of the newsource-language document is then generated in the disambiguatedconceptual representation space, wherein a subset of terms in the newsource-language document is represented in the disambiguated conceptualrepresentation space, such that each term in the subset has arepresentation in the disambiguated conceptual representation space. Aterm in the new source-language document is automatically translatedinto a corresponding target-language term based on a similarity betweenthe representation of the term and the representation of one of thesenses of the polysemous word.

Another embodiment of the present invention provides a method forautomatically translating text based on a disambiguation of text at adictionary-level, including the following steps. First, a conceptualrepresentation space (such as an LSI space) is generated based onsource-language documents and target-language documents. Second, aplurality of dictionaries is provided. Third, a representation of eachdictionary is generated in the conceptual representation space. Fourth,a new source-language document is represented in the conceptualrepresentation space. Fifth, a first dictionary is selected from thecollection of dictionaries based on a similarity between therepresentation of the first dictionary and the representation of the newsource-language document. Then, a term in the new source-languagedocument is automatically translated into a correspondingtarget-language term based on the first dictionary.

A further embodiment of the present invention provides a method forproducing a machine translation of a text passage based on a combinationof a plurality of translations of the text passage, including thefollowing steps. First, a conceptual representation space is generatedbased on a collection of source-language documents and a collection oftarget-language documents. Second, a plurality of translations of a textpassage are provided. The plurality of translations may be received froma conventional translation algorithm, such as a rule-based algorithm, anexample-based algorithm, or a statistical machine translation algorithm.Third, a representation of each translation is generated in theconceptual representation space. Then, the text passage is automaticallytranslated based on similarity comparisons among the representations ofthe translations.

A further embodiment of the present invention provides a method forgenerating a parallel corpus of documents, including the followingsteps. First, a conceptual representation space is generated based on acollection of source-language documents and a collection oftarget-language documents. Each target-language document in thecollection of target-language documents comprises a translation of asource-language document in the collection of source-language documents.Second, a new collection of documents is provided, including bothsource-language documents and target-language documents. Third, arepresentation of each document in the new collection of documents isgenerated in the conceptual representation space. Fourth, a collectionof parallel documents is identified based on similarity comparisonsamong the representations in the conceptual representation space. Then,the collection of source-language documents and the collection oftarget-language documents are combined with the collection of paralleldocuments resulting in a combined collection of documents, and a newconceptual representation space is generated based on the combinedcollection of documents, wherein the new conceptual representation spaceis stored in an electronic format.

A further embodiment of the present invention provides a method forautomatically translating text, including the following steps. First, aconceptual representation space is generated based on source-languagedocuments and target-language documents, wherein respective terms fromthe source-language documents and the target-language documents have arepresentation in the conceptual representation space. Second, asimilarity is measured between at least one pair of terms based on therepresentations of terms included in the at least one pair of terms,wherein the at least one pair of terms includes a term from at least oneof the source-language documents and a term from at least one of thetarget-language documents. Third, the similarity is converted to anassociation probability. Then, the association probability is used as anestimate of a parameter in a statistical translation algorithm.

Techniques in accordance with embodiments of the present inventionprovide several advantages over other techniques, including the exampleadvantages listed below.

1. A method in accordance with an embodiment of the present inventiongenerates conceptual representation spaces that deal with characterstrings and thus are inherently independent of language. Hence,techniques in accordance with embodiments of the present invention canbe applied to all combinations of source and target-languages, and areindependent of genre and subject matter.

2. An embodiment of the present invention can be used for creatingconceptual representation spaces that are generated from largecollections of documents, thus capturing detail of languages in a mannermuch more efficient than human construction.

3. Since methods in accordance with embodiments of the present inventionare based on machine learning principles, the conceptual representationspaces generated by these methods may be continuously and automaticallyupdated with new data, thus keeping pace with changes in language.

4. A method in accordance with the embodiments of the present inventioncan deal directly with teems that are not actual words, such asabbreviations and acronyms.

Further features and advantages of the invention, as well as thestructure and operation of various embodiments of the invention, aredescribed in detail below with reference to the accompanying drawings.It is noted that the invention is not limited to the specificembodiments described herein. Such embodiments are presented herein forillustrative purposes only. Additional embodiments will be apparent topersons skilled in the relevant art(s) based on the teachings containedherein.

BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES

The accompanying drawings, which are incorporated herein and form partof the specification, illustrate the present invention and, togetherwith the description, further serve to explain the principles of theinvention and to enable a person skilled in the relevant art(s) to makeand use the invention.

FIG. 1 illustrates formation of the term by document matrix used in anembodiment of the present invention.

FIG. 2 illustrates decomposition of the term by document matrix of anembodiment of the present invention into three constituent matrices.

FIG. 3 illustrates formation of the LSI matrix used in an embodiment ofthe present invention.

FIG. 4 illustrates the location of the training documents in the dataobject space for an example reduced to two dimensions in a dual languageexample.

FIG. 5 depicts a flowchart of a method for automatically translatingtext in accordance with an embodiment of the present invention.

FIG. 6 depicts a flowchart of a method for automatically accounting forword sense disambiguation at a dictionary level in accordance with anembodiment of the present invention.

FIG. 7 depicts a flowchart of a method for automatically accounting forword sense disambiguation at a word level in accordance with anembodiment of the present invention.

FIG. 8 depicts a flowchart of a method for automatically treatingidiomatic expressions in machine translation systems in accordance withan embodiment of the present invention.

FIG. 9 is a block diagram of a computer system on which an embodiment ofthe present invention may be executed.

The features and advantages of the present invention will become moreapparent from the detailed description set forth below when taken inconjunction with the drawings, in which like reference charactersidentify corresponding elements throughout. In the drawings, likereference numbers generally indicate identical, functionally similar,and/or structurally similar elements. The drawing in which an elementfirst appears is indicated by the leftmost digit(s) in the correspondingreference number.

DETAILED DESCRIPTION I. Introduction

As is described in more detail herein, according to an embodiment of thepresent invention there is provided a method and system for improvingmachine translation of text. A conceptual representation afforded by anabstract mathematical vector space addresses the machine translationproblems by more effectively dealing with word sense ambiguity,idiomatic expressions, anaphoric resolution, statistical machinetranslation and logical decomposition.

It is noted that references in the specification to “one embodiment”,“an embodiment”, “an example embodiment”, etc., indicate that theembodiment described may include a particular feature, structure, orcharacteristic, but every embodiment may not necessarily include theparticular feature, structure, or characteristic. Moreover, such phrasesare not necessarily referring to the same embodiment. Further, when aparticular feature, structure, or characteristic is described inconnection with an embodiment, it is submitted that it is within theknowledge of one skilled in the art to effect such feature, structure,or characteristic in connection with other embodiments whether or notexplicitly described.

As used herein, a “term” shall mean any string of characters, includingletters, numbers, symbols, and other similar characters. An example of aterm can include, but is not limited to, a word, a collection of words,a word stem, a collection of word stems, a phrase, an acronym, analphanumeric designator, an entity name, and similar strings ofcharacters and combinations thereof. It is to be appreciated that theword “term” as used herein may refer to a string of characters in anyhuman language, computer language, or other similar language comprisedof strings of characters.

Embodiments of the present invention are described below in terms of aparticular abstract mathematical space called a Latent Semantic Indexing(LSI) space. This is for illustrative purposes only, and not limitation.It will be apparent to a person skilled in the relevant art(s) from thedescription contained herein how to implement embodiments of the presentinvention in other abstract mathematical spaces.

II. Overview

Methods have been developed for generating vector space representationsof language that demonstrably capture aspects of the conceptual contentof text. For example, one of these techniques is called latent semanticindexing (LSI), an implementation of which is described below in SectionIII and in U.S. Pat. No. 4,839,853 (the '853 patent), entitled “ComputerInformation Retrieval Using Latent Semantic Structure” to Deerwester etal., the entirety of which is incorporated by reference herein.

The LSI technique can automatically process arbitrary collections oftext and generate a high-dimensional vector space in which both textobjects (generally documents) and terms are distributed in a fashionthat reflects their meaning. An extension of this technique allowsprocessing of phrases. Experiments have demonstrated a strikingsimilarity between some aspects of the text processing in the LSIrepresentation space and human processing of language, as discussed byLandauer, T., et al., in “Learning Human-Like Knowledge by SingularValue Decomposition: A Progress Report,” in M. I. Jordan, M. J. Kearnsand S. A. Solla (Eds.), Advances in Neural Information ProcessingSystems 10, Cambridge: MIT Press, pp. 45-51 (1998), the entirety ofwhich is incorporated by reference herein.

Other techniques have also been developed that generate high-dimensionalvector space representations of text objects and their constituentterms, for example, as described in the following references: (i)Marchisio, G., and Liang, J., “Experiments in Trilingual Cross-languageInformation Retrieval,” Proceedings, 2001 Symposium on Document ImageUnderstanding Technology, Columbia, Md., 2001, pp. 169-178; (ii)Hoffman, T., “Probabilistic Latent Semantic Indexing,” Proceedings ofthe 22^(nd) Annual SIGIR Conference, Berkeley, Calif., 1999, pp. 50-57;(iii) Kohonen, T., “Self-Organizing Maps,” 3 ^(rd) Edition,Springer-Verlag, Berlin, 2001; and (iv) Kolda, T., and O.Leary, D., “ASemidiscrete Matrix Decomposition for Latent Semantic IndexingInformation Retrieval,” ACM Transactions on Information Systems, Volume16, Issue 4 (October 1998), pp. 322-346. The entirety of each of theseis incorporated by reference herein. In the present application, theconceptual representation spaces generated by LSI or any of the otherforegoing techniques will be referred to generally as “conceptualrepresentation spaces.”

An embodiment of the present invention is premised on the recognitionthat, at a fundamental level, properties of a conceptual representationspace provide a mechanism for facilitating machine translation. In aconceptual representation space, terms that are similar in meaning haveassociated vector representations that are close together in the space.In an embodiment of the present invention, a conceptual representationspace is generated based on source-language documents andtarget-language documents. An example method for generating such aconceptual representation space is described below in Section IV and inU.S. Pat. No. 5,301,109 (the '109 patent), entitled “ComputerizedCross-language Document Retrieval Using Latent Semantic Indexing” toLandauer et al., the entirety of which is incorporated by referenceherein. In such a space, terms in one language have vectorrepresentations that are close to the vector representations for termsof similar meaning in other language(s). An embodiment of the presentinvention exploits this fact to improve machine translation by: (1)creating a cross-lingual conceptual representation space for asource-language and a target-language; and (2) translating terms in asource text based on a similarity (such as, a closeness) with terms ofthe target-language in the conceptual representation space.

The above-described method can be used on its own or as a supplement tosource-language to target-language mappings generated via other means(such as, from a bilingual dictionary). For example, an extension of theabove-described method can be used to improve automatic machinetranslation of text while accounting for word sense disambiguation atthe dictionary level or at the word level, as described in Sections VIand VII, respectively. Terms that are translated may include words,acronyms, abbreviations, and idiomatic expressions, as described inSection VIII. Alternative embodiments are described in Section IX,including anaphora resolution, logical decomposition, data fusion,statistical machine translation, and boot-strapping (to generate aparallel corpus of documents). Then, an example computer system isdescribed in Section X, which computer system may be used to implementmethods in accordance with embodiments of the present invention.

III. Overview of Latent Semantic Indexing A. Introduction

Before discussing details of embodiments of the present invention, it ishelpful to present a motivating example of LSI, which can also be foundin U.S. Pat. No. 7,024,407, entitled “Word Sense Disambiguation” toBradford, the entirety of which is incorporated by reference herein.This motivating example is used to present an overview of the LSItechnique and how it may be used to generated a disambiguated LSI spaceand/or a cross-lingual conceptual representation space, as described inSection IV.

To generate an LSI vector space, the following pre-processing steps maybe applied to the text. First, frequently-occurring words (such as“the,” “and,” “of,” and similar words) may be removed. Suchfrequently-occurring words, typically called “stop words,” have littlecontextual discrimination value. Second, certain combinations of term's(such as, United States) may be treated as phrases. Third, hyphenatedterms may be split into separate terms. Fourth; a process, known as wordstemming, may be applied in which a word is reduced to its root form.For example, the words “clears,” “cleared,” and “clearing” would all bereduced to the stem “clear.” The extent to which any or all of thesefour pre-processing steps are applied will depend on the application.

Although other vector space representations could be used in accordancewith embodiments of the present invention, the technique of LatentSemantic Indexing (LSI) provides a vector space that is desirable in keyrespects. The LSI technique (including singular value decomposition anddimensionality reduction as described herein) provides a method forextracting semantic information that is latent in a collection of textthat is susceptible to a high degree of automation. This technique cancreate a full index (that is, an LSI vector space) of a collection ofdocuments without significant human intervention. The LSI technique isdescribed in Deerwester, S., et al., “Indexing by Latent SemanticAnalysis,” Journal of the American Society for Information Science,41(6), pp. 391-407, October, 1990 and in U.S. Pat. No. 4,839,853 (the'853 patent). The entirety of each of these references is incorporatedby reference herein. The optimality of this technique is shown in Ding,C., A Similarity-based Probability Model for Latent Semantic Indexing,Proceedings of the 22nd Annual SIGIR Conference, Berkeley, Calif.,August, 1999. The LSI technique has been shown to be of considerablevalue as an approach to text retrieval.

The LSI technique starts with a collection of text passages, typicallyreferred to in the literature as documents. The term document in thiscase may refer to paragraphs, pages, or other subdivisions of text andnot necessarily to documents in the usual sense, i.e., externallydefined logical subdivisions of text. For simplicity, this disclosurefollows the standard convention of referring to the text passages ofinterest as documents. The disclosure uses term and word interchangeablyas elements of documents.

The use of LSI is illustrated with reference to FIG. 1. As a first step,a large sparse matrix 10 is formed. The matrix 10 is typically referredto as a term-by-document matrix (T by D matrix, for short), which has adimension m×n, where m is equal to the number of unique terms consideredand n equals the number of documents considered. Each row (such as row12) in the T by D matrix 10 corresponds to a term that appears in thedocuments of interest, and each column (such as column 14) correspondsto a document. Each element (i, j) in the matrix corresponds to thenumber of times that the term corresponding to row i occurs in thedocument corresponding to column j. For example, in FIG. 1, “able”appears one time in Doc #1 and “acid” appears two times in Doc #2.

Referring to FIG. 2, a known technique of singular value decomposition(SVD) can be used to decompose the T by D matrix to a product of threematrices—namely, a term matrix 30, a singular value matrix 32, and adocument matrix 34. The singular value matrix 32 has non-zero valuesonly on the diagonal. Small values on this diagonal, and theircorresponding rows and columns in the term matrix 30 and column matrix34 are then deleted. This truncation process is used to generate avector space of reduced dimensionality as illustrated in FIG. 3 byrecombining the three truncated matrices into (T by D)′ matrix. Therelationship between the position of terms and documents in this newvector space are subject to the same properties as in the originalvector space.

B. General Model Details

It is now elucidating to describe in somewhat more detail themathematical model underlying the latent structure, singular valuedecomposition technique.

Any rectangular matrix Y of t rows and d columns, for example, a t-by-dmatrix of terms and documents, can be decomposed into a product of threeother matrices:

Y₀=T₀S₀D₀ ^(T)  (1)

such that T₀ and D₀ have unit-length orthogonal columns (i.e. T₀^(T)T₀=1; D₀ ^(T)D₀=I) and S₀ is diagonal. This is called the singularvalue decomposition (SVD) of Y. A procedure for SVD is described in thetext “Numerical Recipes,” by Press, Flannery, Teukolsky and Vetterling,1986, Cambridge University Press, Cambridge, England, the entirety ofwhich is incorporated by reference herein. T₀ and D₀ are the matrices ofleft and right singular vectors and S₀ is the diagonal matrix ofsingular values. By convention, the diagonal elements of S₀ are orderedin decreasing magnitude.

With SVD, it is possible to devise a simple strategy for an optimalapproximation to Y using smaller matrices. The k largest singular valuesand their associated columns in T₀ and D₀ may be kept and the remainingentries set to zero. The product of the resulting matrices is a matrixY_(R) which is approximately equal to Y, and is of rank k. The newmatrix Y_(R) is the matrix of rank k which is the closest in the leastsquares sense to Y. Since zeros were introduced into S₀, therepresentation of S₀ can be simplified by deleting the rows and columnshaving these zeros to obtain a new diagonal matrix S, and then deletingthe corresponding columns of T₀ and D₀ to define new matrices T and D,respectively. The result is a reduced model such that

Y^(R)=TSD^(T).  (2)

The value of k is chosen for each application; it is generally such thatk≧100 for collections of 1000-3000 data objects.

C. Example Similarity Comparisons

For discussion purposes, it is useful to interpret the SVDgeometrically. The rows of the reduced matrices T and D may be taken asvectors representing the terms and documents, respectively, in ak-dimensional space. These vectors then enable the mathematicalcomparisons between the terms or documents represented in this space.Typical comparisons between two entities involve a dot product, cosineor other comparison between points or vectors in the space or as scaledby a function of the singular values of S. For example, if d₁ and d₂respectively represent vectors of documents in the D matrix, then thesimilarity between the two vectors (and, consequently, the similaritybetween the two documents) can be computed as any of: (i) d₁·d₂, asimple dot product; (ii) (d₁·d₂)/(∥d₁∥×∥d₂∥), a simple cosine; (iii) (d₁S)·(d₂S), a scaled dot product; and (iv) (d₁ S·d₂S)/(∥d₁S∥×∥d₂S∥), ascaled cosine.

Mathematically, the similarity between representation d₁ and d₂ can berepresented as

d₁|d₂

. Then, for example, if the simple cosine from item (ii) above is usedto compute the similarity between two vectors,

d₁|d₂

can be represented in the following well-known manner:

$\begin{matrix}{{{\langle\left. d_{1} \middle| d_{2} \right.\rangle} = {\frac{d_{1} \cdot d_{2}}{{d_{1}}{d_{2}}} = {\frac{1}{{d_{1}}{d_{2}}}\left\lbrack {\sum\limits_{i = 1}^{k}{d_{1,i}d_{2,i}}} \right\rbrack}}},} & (3)\end{matrix}$

where d_(1,i) and d_(2,i), are the components of the representations d₁and d₂, respectively.

D. Folding In Documents

It is often useful to generate a representation of a document in the LSIspace, even when that document is not used to generate the LSI space.The process of representing a document in an LSI space is often referredto as “folding” the document into the LSI space. The mathematicaldetails for this process are the same whether the document is anexisting document in the LSI space or a new document that is to berepresented in the LSI space.

One criterion for such a derivation is that the insertion of a realdocument Y_(i) should give D_(i) when the model is ideal (i.e.,Y=Y_(R)). With this constraint,

Y_(q)=TSD_(q) ^(T).  (4)

Multiplying both sides of equation (4) by the matrix T^(T) on the left,and noting that T^(T)T equals the identity matrix, yields,

T_(T)Y_(q)=SD_(q) ^(T).

Multiplying both sides of this equation by S⁻¹ and rearranging yieldsthe following mathematical expression for folding in a document:

D_(q)=Y_(q) ^(T)TS⁻¹.  (5)

Thus, with appropriate resealing of the axes, folding a document intothe LSI space amounts to placing the vector representation of thatdocument at the scaled vector sum of its corresponding term points.

As a prerequisite to folding a document into an LSI space, at least oneor more of the terms in that document must already exist in the termspace of the LSI space. The location of a new document that is foldedinto an LSI space (“the folded location”) will not necessarily be thesame as the location of that document had it been used in the creationof the LSI space (“the ideal location”). However, the greater theoverlap between the set of terms contained in that document and the setof terms included in the term space of the LSI space, the more closelythe folded location of the document will approximate the ideal locationof the document.

E. Folding In Terms

Similar to documents, the process of representing a term in an LSI spaceis often referred to as “folding” the term into the LSI space. Themathematical details for this process are the same whether the term isan existing term in the LSI space or a new term that is to berepresented in the LSI space.

Folding a term into the LSI space is similar to folding a document intothe LSI space. The basic criterion is that the insertion of a real terminto Y_(i) should give T_(i) when the model is ideal (i.e., Y=Y_(R)).With this constraint,

Y_(q)=T_(q)SD^(T).  (6)

Multiplying both sides of equation (6) by the matrix D, and noting thatD^(T)D equals the identity matrix, yields

Y_(q)D=T_(q)S.  (7)

Multiplying both sides of equation (7) by S⁻¹ and rearranging yields thefollowing mathematical expression for folding in a term:

T_(q)=Y_(q)DS⁻¹.  (8)

Thus, with appropriate resealing of the axes, perturbing an LSI space tofold a term into the LSI space amounts to placing the vectorrepresentation of that term at the scaled vector sum of itscorresponding document points.

As a prerequisite to folding a term into an LSI space, at least one ormore of the documents using that term must already exist in the documentspace of the LSI space. Similar to documents, the location of a new termthat is folded into an LSI space (“the folded location”) will notnecessarily be the same as the location of that term had it been used inthe creation of the LSI space (“the ideal location”). However, thegreater the number of documents in the LSI space that use that term, themore closely the folded location of the term will approximate the ideallocation of the term.

IV. Multi-Language Case

To extend the principles of LSI to cross-language retrieval, a documentset comprising all documents of interest, in the languages to besearched, is formed. A subset of the documents, called the “trainingset,” is selected; the “training set” is composed of documents for whichtranslations exist in all the languages (two or more). The so-called“joint” term-by-document matrix of this set is composed from theaddition of the terms in their renditions in all the languages. Thisjoint matrix differs from the single-language LSI matrix in that eachcolumn, which represents a single multi-language document, is thecombination of terms from the two (or more) languages coalesced intojust a single column vector. As with the single-language technique, thejoint matrix is then analyzed by singular value decomposition. Theresulting representation defines vectors for the training-set terms anddocuments in the languages under consideration. Once the traininganalysis has been completed, other single-language documents can be“folded in” as pseudo-documents on the basis of terms from any one ofthe original languages alone. Most importantly, a user query is treatedas such a new document.

In the derived indexing space there is a point representing each term inthe training set. A new single-language document is assigned a point inthe same space by putting it at an appropriate average of the locationof all the terms it contains. For cross-language retrieval, the samenumber or greater of dimensions are kept as would be required torepresent the collection in a single language. As outlined above, fullor partial equivalence (in the sense that one term will have the same orsimilar effect in referencing documents as another) is induced betweenany two or more terms approximately to the extent that their pattern ofuse, or the overall pattern of association between other terms withwhich they co-occur, is similar across documents in the training set.Equivalent or nearly equivalent terms in different languages would, ofcourse, be expected to be distributed in nearly the same way in a set ofdocuments and their translations. Thus, the location of two or moreequivalent terms in different languages should be almost the same in theresulting representation. Consequently, a document folded in by terms inone language is retrieved by a query containing the appropriate set ofwords in another language.

A simple example may aid in understanding the general procedure. Forthis example, a training set of “documents” is composed of four titles,each of which is stated in both English and French.

Training Doc. T1. Effect of falling oil prices on small companies. Lesconsequences de la chute des prix du petrole pour les petitescompagnies.

Training Doc. T2. Low oil prices—Effect on Calgary. La baisse des prixpetroliers—Les consequences pour les citoyens de Calgary.

Training Doc. T3. Canadian nuclear power stations—Safety precautions.Les reacteurs nucleaires canadiens—Les precautions prises pour enassurer la securite.

Training Doc. T4. Safety standards for nuclear power plants—Swedish callfor international conference. Les normes de securite en matiere decentrales nucleaires—L'appel de la Suede en faveur d'une conferenceinternationale.

First the 55 (20 English-only, 32 French-only, and 3 both) jointterm-by-four document training matrix formed from these “documents” isconstructed, as partially depicted in Table 1; this table shows thefirst six English-only words, the three words shared by both languages,and the last three French-only words. It is this joint matrix that willbe decomposed by SVD.

TABLE 1 DOCUMENTS TERMS T1(e1, f1) T2(e2, f2) T3(e3, f3) T4(e4, f4)effect 1 1 0 0 of 1 0 0 0 falling 1 0 0 0 oil 1 1 0 0 prices 1 1 0 0 on1 1 0 0 Calgary 0 2 0 0 precautions 0 0 2 0 conference 0 0 0 2 d 0 0 0 1une 0 0 0 1 internationale 0 0 0 1

As is apparent from the joint term-by-document training matrix of Table1, each document is composed of all the terms in both French andEnglish, i.e. the addition of terms from each document including itstranslation(s). For instance, since the term “precautions” appears asthe same term in both the English and French versions, there is an entryof “2” under title T3 in the “precautions” row. As suggested by theforegoing illustrative example, the general procedure for formulatingthe joint term-by-document matrix for the multi-language case is asfollows:

(1) for each document in the training set written in an originallanguage, translate this document into all the other languages. (In theabove example, each of the four training documents is in English, whichis considered the original language, and each is translated to one otherlanguage, namely, French);

(2) each original document plus all of the other translations of eachoriginal document are parsed to extract distinct terms composing themulti-language documents. These terms define a database designated thelexicon database, and this database is stored in a memory of a computer.The lexicon database is used in constructing the general jointterm-by-document matrix as presented below. (In the above example, thefirst document contained eight (8) distinct English terms and twelve(12) distinct French terms—“les” is repeated; the second documentcontains only two (2) more distinct English terms not contained in thefirst English document, namely, “low” and “Calgary”. The terms “oil”,“prices”, “effect”, and “on” are already in the lexicon database as aresult of parsing the first English document. Continued parsing in thismanner results in the fifty-five (55) distinct terms presented above,namely, 20 English-only, 32 French-only and 3 terms common to bothlanguages.)

(3) the distinct terms from the lexicon database are then treated asbeing listed in a column, such as the TERMS column in Table 1, as an aidin preparing the joint term-by-document matrix; this column contains trows. Each training document, composed of both the original as well asall translations, is assigned one column in the joint matrix; if thereare d training documents, then there are d columns. Any (i,j) cell inthe joint term-by-document matrix, that is, the intersection of thei^(th) “term” row with the j^(th) “document” column contains atabulation of the frequency of occurrence of the term in the i^(th) rowwith the document assigned to the j^(th) column. (In the example,training document T2 is shown to have a tabulation of 1 in the row withthe term “effect” since it appears only once in the coalesced or mergedEnglish and French versions of the document. In contrast, there is anentry of 2 in the row with the term “Calgary” since it appears twice inthe documents of T2, namely, once in the English document and once inthe French document.)

It is important to understand that it is not necessary to use allavailable documents to compose the training set. One useful test for thenumber of documents to include in the training set is the satisfactoryretrieval of a document written in one language as determined byinputting the terms of the document as a query in another language. Oneillustrative test for the sufficiency of the training set will bepresented below after the joint term-by-document matrix is decomposed.Also, it is important to realize that some retrieval situations will notrequire assigning all terms obtained during the parsing step to thelexicon database. A test of what terms to assign to the database isagain the satisfactory retrieval of a document written in one languageas determined by inputting the terms of the document as a query inanother language.

By way of terminology, the generalization of a “document” is called a“data object,” to include applications such as graphics-type informationas well as text. Moreover, the coalesced version of all translations ofa data object as well as the original data object is called a mergeddata object.

The results of the decomposition are shown in Table 2, Table 3, andTable 4 for two dimensions.

TABLE 2 TERM MATRIX (55 terms by 2 dimensions) effect 0.0039 −0.1962 of0.0042 −0.2550 falling 0.0042 −0.2550 oil 0.0039 −0.1962 prices 0.0039−0.1962 on 0.0039 −0.1962 Calgary 0.0056 −0.2178 precautions 0.0451−0.0036 conference 0.3299 0.0124 d 0.2081 0.0078 une 0.2081 0.0078internationale 0.2081 0.0078

TABLE 3 DOCUMENT MATRIX (4 documents by 2 dimensions) T1 0.0200 −0.8799T2 0.0169 −0.4743 T3 0.1355 −0.0079 T4 0.9904 0.0269

TABLE 4 DIAGONAL (2 singular values) 3.2986 2.3920

FIG. 4 shows the location of the four training documents in this space.Since the angle of the coordinates representative of each document isthe important parameter for search purposes and the absolute magnitudeof the coordinates of each document is relatively unimportant for searchpurposes, the magnitude of each document has been normalized to unitmagnitude for clarity of presentation.

Next, all single-language documents are folded into the space derivedfrom the training set. Each remaining document is folded into theresulting space separately in its English and French versions, i.e.using only English terms and then only French terms in thepseudo-document representation of equation (5): for instance,

New Doc Ne. Ontario—Premier's rejection of further nuclear power plants.(Absolute Coordinates of 0.0695,−0.0708)

New Doc Nf. L'ontario—le refus du premier ministre de favoriser laconstruction d'autres centrales nucleaires. (Absolute coordinates of0.1533,−0.0775)

As shown, the English-only and French-only versions, Ne and Nf, end upclose (“similar”) to one another and well separated from the other textitems in the space. In fact, for a search angle of approximatelyplus/minus 26 degrees (cosine of 0.90), each document falls within theangle of similarity of the other document. The degree of similarity orcloseness of corresponding documents folded into the semantic spaceafter training is used as a test for the sufficiency of the set of dataobjects selected to train the semantic space. For instance, aftertraining, if a set of documents like Ne and Nf does not fall within apre-selected angle of similarity, then it may be necessary to re-trainthe semantic space in order to meet the prescribed retrievalcriterion/criteria—for the illustrative case, a single criterion isfalling within the angle of search. Typically, paragraphs of 50 words ormore from 500 or more multi-language documents are suitable to train thesemantic space.

V. An Example Embodiment

Referring to FIG. 5 there is depicted a flowchart of a method 500 forautomatically translating text in accordance with an embodiment of thepresent invention. Method 500 begins at a step 520 in which a collectionof parallel documents is provided—that is, a collection of documents forwhich each document has a source-language version and a target-languageversion. Each target-language document (such as, English) is atranslation of a source-language document (such as, Arabic). Inaddition, the source-language documents can be translated into more thanone target-language.

In a step 530, a conceptual representation space is generated based onterms in the parallel collection of documents. For example, theconceptual representation space may be generated in accordance with theLSI technique, an implementation of which is described above and incommonly-owned U.S. Pat. No. 4,839,853 entitled “Computer InformationRetrieval Using Latent Semantic Structure” to Deerwester et al., theentirety of which is incorporated by reference herein. Additionally oralternatively, the conceptual representation space may be generated inaccordance with a multi-lingual method as described above and in U.S.Pat. No. 5,301,109, entitled “Computerized Cross-language DocumentRetrieval Using Latent Semantic Indexing.” Step 530 can also beperformed using alternative techniques or combinations thereof forgenerating a conceptual representation space.

In a step 540, a representation of a new source-language document (thatis, a document that is to be translated) is generated in the conceptualrepresentation space. For example, the new source-language document maybe folded-in to the conceptual representation space as described above.The source-language document is parsed to determine the terms in thedocument. Generating the conceptual representation space (step 530) willafford many of the terms with a vector representation in that space.However, new terms may be present in the source-language document thatdo not already have a vector representation in the conceptualrepresentation space. A vector representation can be established forthese new teems. A potential meaning of each new term can be inferredfrom the term's vector representation in the conceptual representationspace.

In a step 550, a term in the source-language document of step 540 isautomatically translated into a corresponding term in a target-languagedocument based on a similarity between the representation of the term inthe source-language document and the representation of the correspondingtarget-language term. The similarity can be measured using anysimilarity metric defined on the conceptual representation space.Examples of similarity metrics include, but are not limited to, a cosinemeasure, a dot product, an inner product, a Euclidean distance measure,or some other similarity measure as would be apparent to a personskilled in the relevant art(s). Step 550 can be repeated for each termin the source-language document that is to be translated.

VI. Word Sense Disambiguation at a Dictionary Level

As will be described in more detail below, an embodiment of the presentinvention can address word sense disambiguation at a dictionary level.

One of the primary sources of error in machine translation is theselection of a wrong sense of a word. For example, in an article aboutmilitary operations, the word “tank” can have a different meaning thanthe same word in an article about automobiles. Some commercial machinetranslation systems allow a user to choose different dictionariesdepending on the subject matter being translated. These dictionariesprovide the most common translations of words within the context oftheir specified subject matter. Thus, while a general-purpose dictionarymight provide a translation of “tank” using the sense of container forliquid, a military-specific dictionary would likely provide the sense ofarmored vehicle. Use of such dictionaries can considerably improve thequality of machine translation. However, there are two key drawbacks ofexisting implementations.

First, users must manually choose the dictionary to be used. This hasdrawbacks in terms of cost and time, and may present a problem for auser who has no knowledge of the source-language and may not have any apriori knowledge of the subject matter of an item to be translated.Second, a single dictionary is applied to a complete document or set ofdocuments. This is a significant problem, as many documents treat morethan one topic.

FIG. 6 depicts a flowchart 600 of a method in accordance with anembodiment of the present invention that uses characteristics of aconceptual representation space to overcome the above-describedlimitations. As shown in FIG. 6, the method of flowchart 600 begins at astep 610 in which a set of source-language documents is assembled. Theassembled documents cover relevant concepts and vocabulary for materialthat is to be translated.

In a step 620, a conceptual representation space is generated from thesedocuments. For example, the conceptual representation space may begenerated in accordance with the LSI technique described above and indetail in commonly-owned U.S. Pat. No. 4,839,853 or by any other knowntechnique for generating a conceptual representation space.

In a step 630, vectors in the conceptual representation space arecreated that are representative of the dictionaries available to be usedin the machine translation process. Step 630 can be implemented usingseveral different methods. For example, in accordance with a firstmethod, for each available dictionary, many or all of thesource-language terms and phrases in that dictionary are concatenatedinto a single text object and a corresponding vector representation iscreated in a manner consistent with the type of conceptualrepresentation space. For example, in an LSI space, the correspondingvector representation can be created by applying a pseudo-querytechnique as described in commonly-owned U.S. Pat. No. 4,839,853.

In accordance with a second method for implementing step 630, for eachavailable dictionary, source-language terms and phrases in thedictionary are rationally partitioned. Next, corresponding text objectsare created from the source-language terms and phrases in eachpartition. Then, a corresponding vector representation is created foreach text object in a manner consistent with the conceptualrepresentation space.

In accordance with a third method for implementing step 630,translations of some or all of the target-language text from thedictionaries can be used to augment the first or second methodsdescribed immediately above.

In accordance with a fourth method for implementing step 630, across-lingual conceptual representation space is created. For example,methods analogous to those described in commonly-owned U.S. Pat. No.5,301,109 may be used to create a cross-lingual conceptualrepresentation space. Then, one or more vector representations arecreated for each dictionary based on some combination of the source andtarget-language text contained in the dictionaries. For example, textfrom a source-language dictionary may be concatenated with text from atarget-language dictionary, and a vector representation can be generatedfor this concatenation.

In a step 640, for each document to be translated, a vectorrepresentation for that document is created using an appropriateapproach for the particular conceptual representation space beingemployed.

In a step 650, during translation, apply a dictionary that is mostconceptually similar to a document to be translated. For most conceptualrepresentation spaces, the most conceptually similar dictionary can bedetermined by finding the closest dictionary-related vector to thedocument vector. In an LSI space, for example, “closeness” can bedetermined by a cosine measure, or some other similarity measure,defined in the space.

Auxiliary structures other than dictionaries (such as, synonym lists,lists of expansions for acronyms and abbreviations, lists of idiomsubstitutions, etc.) may be treated in a manner analogous to thatdescribed above with reference to flowchart 600. Auxiliary structuresfor multiple languages can be represented in a single conceptualrepresentation space, thus allowing the above technique to be applied tomore than one source-language and/or more than one target-language in asingle conceptual representation space.

The method depicted by flowchart 600 does not require a user to manuallychoose a dictionary to be used. In addition, an extension of the methodof flowchart 600 can be used so that a single dictionary is applied toconceptually coherent portions of documents to be translated. Forexample, commonly-owned U.S. patent application Ser. No. 11/316,837entitled “Automatic Linear Text Segmentation” to Price, filed Dec. 27,2005, which corresponds to U.S. Published Patent Application No.2006/0224584, (the entirety of which is incorporated by referenceherein) describes a method for automatically decomposing documents intoconceptually cohesive portions. For documents treating more than onetopic, this method can be used to identify conceptually coherentportions of the documents. Then, the method of flowchart 600 can besequentially applied to each of the conceptually coherent portions.

VII. Word Sense Disambiguation at a Word Level

As will be described in more detail below, a method in accordance withan embodiment of the present invention addresses word sensedisambiguation at a word level.

Choosing an appropriate dictionary to be used in a machine translationprocess, as described above, can have a beneficial effect on the qualityof the translations produced. However, improvement in translationquality can also be obtained through word sense disambiguation at theword level. An embodiment of the present invention provides a methodthat uses automated word sense disambiguation in a conceptualrepresentation space to improve machine translation. For example, theautomated word sense disambiguation can be achieved by employing amethod described in commonly-owned U.S. Pat. No. 7,024,407, entitled“Word Sense Disambiguation” to Bradford, the entirety of which isincorporated by reference herein.

FIG. 7 depicts a flowchart 700 of a method for disambiguating word senseat a word level. As shown in FIG. 7, flowchart 700 begins at a step 710in which a set of source-language documents is assembled. The assembleddocuments cover relevant concepts and vocabulary for material that is tobe translated.

In a step 720, a conceptual representation space is generated from thesedocuments using a technique consistent with the type of conceptualrepresentation space. For example, the conceptual representation spacecan be an LSI space as described above. In this example, the techniquefor generating the conceptual representation space would be similar to atechnique described in U.S. Pat. No. 4,839,853

In a step 730, a disambiguated version of the conceptual representationspace is generated. Commonly-owned U.S. Pat. No. 7,024,407, entitled“Word Sense Disambiguation” to Bradford, describes methods forgenerating disambiguated versions of a conceptual representation space.

In a step 740, for each document to be translated, a vectorrepresentation for that document is created in the disambiguatedconceptual representation space. This vector representation can becreated in a manner consistent with the particular type of conceptualrepresentation space generated in step 720. For example, in an LSIspace, this vector representation can be created by application of apseudo-query technique described in detail in U.S. Pat. No. 4,839,853.In the disambiguated conceptual representation space, this could requireiterated disambiguation. That is, a first estimate of the vectorrepresentation for the document can be generated based on vectorcombination of either: (i) the vectors representing the most commonsenses of polysemous words it contains; or (ii) the vectors representingthe averages of the word senses for the polysemous words (that is, thevectors generated in creating the initial conceptual representationspace, prior to disambiguation).

Based on the initial vector representation, vectors representing theclosest word senses are then chosen for each polysemous word in thedocument. (For these purposes, a word is polysemous if there is morethan one vector representation generated for that word in thedisambiguation process of step 730.) A new estimate of the vectorrepresentation for the document is generated by vector combination (suchas, vector addition, vector averaging, or the like) using these vectorrepresentations. This process may have to be repeated until either thereis no more change in the calculated vector representation or the changesin that vector are below a threshold.

In a step 750, during translation, the indicated sense of each word orphrase is used in the translation of the polysemous word or phrase inthe source document. For example, in applying a bilingual dictionary,the dictionary meaning corresponding to the sense indicated in theconceptual representation space can be used. If the senses are labeledaccording to a tagging method, step 750 may be implemented by comparingtags and labeled senses in the dictionary. For example, the taggingmethod can be similar to that described in commonly-owned U.S. Pat. No.7,024,407, entitled “Word Sense Disambiguation” to Bradford. If atagging method is not used, step 750 may be implemented by comparingpositions in the conceptual representation space of word sense vectorsand dictionary entries.

It is to be appreciated that dictionary entries for more than onelanguage can be represented in a single conceptual representation space,allowing this technique to be applied to multiple target-languages usinga single conceptual representation space.

VIII. Idiomatic Expressions

A difficult problem for machine translation algorithms is the occurrenceof idiomatic expressions, such as “raining cats and dogs.” Manyidiomatic expressions have a standard format, such as “good as gold,” ora small number of possible format variants, such as “hold(his/her/one's) horses.” A standard feature of conceptual representationspaces is that terms that are similar in meaning (such as, car andautomobile) are located close to each other in the conceptualrepresentation space. Phrases can be treated as units in the creation ofconceptual representation spaces, as described, for example, incommonly-owned U.S. Pat. No. 7,113,943, entitled “Method for DocumentComparison and Selection” to Bradford (Publication No. 2002/0103799 A1),the entirety of which is incorporated by reference herein. In theresulting conceptual representation space, the vector representation forthe phrase will be located near words that have meanings similar to thephrase. For example, the representation vectors for LAN and “local areanetwork” will be very close together in a conceptual representationspace containing that term and that phrase (provided “local areanetwork” is indexed as a phrase).

Multilingual conceptual representation spaces can be generated using amethod as described above and in U.S. Pat. No. 5,301,109. In suchmultilingual spaces, terms and phrases in one language have vectorrepresentations that are close to the vector representations for termsand phrases in the other language that are similar in meaning. This factprovides the basis for an embodiment of the present invention thatenables treatment of idiomatic expressions in machine translationsystems.

FIG. 8 depicts a flowchart 800 of a method for treating idiomaticexpressions in a machine translation system in accordance with anembodiment of the present invention. As shown in FIG. 8, the method offlowchart 800 begins at a step 810 in which a cross-lingual conceptualrepresentation space is created for a source and target-language(s). Thecross-lingual conceptual representation space may be created, forexample, in accordance with a method as described above and in U.S. Pat.No. 5,301,109.

In a step 820, idiomatic expressions are identified in thesource-language and treated as phrases. This may be achieved in a numberof ways. For example, a list of idiomatic expressions may be availablefor the language of interest (such as, a list of idiomatic expressionsused in English is provided by “The American Heritage Dictionary ofIdioms,” Houghton Mifflin Company, Boston, 1997). In that case, the listof idiomatic formats is used to determine sequences of words that willbe treated as phrases during the pre-processing stage of creating theconceptual representation space. Alternatively, an automated mechanismfor identification of idiomatic expressions can be employed. Forexample, the idiomatic expressions can be automatically identified inthe following manner.

First, through statistical analysis of a significant body ofsource-language material, sequences of words that appear more often thana threshold are identified. For example, the threshold can beheuristically determined. These sequences constitute candidate idiomaticexpressions. In general, the number of words in an idiom will be limitedin extent. For example, in English, many idiomatic expressions do notexceed five words.

Second, these candidate idiomatic expressions can be iteratively treatedas phrases in creating a conceptual representation space (more than oneat a time may be treated in a single iteration).

Third, the vector representation for the candidate idiom is comparedwith the vector representation created by combining the vectorrepresentations for the constituent words of that candidate idiom. Thecombination is carried out in accordance with the standard methodapplicable to the particular type of conceptual representation spacebeing used. For example, in the case of an LSI space, this can be aweighted average of the constituent vectors, as calculated for apseudo-object as described in detail in U.S. Pat. No. 4,839,853.

Fourth, if the vector representation for the candidate idiom differsfrom that of the combined individual words of the candidate by more thana heuristically-determined amount, the candidate is treated as an idiomin further conceptual representation space processing.

Referring back to flowchart 800, in a step 830, during the translationprocess, idiomatic expressions are identified in the source text throughcomparison to the list of idiomatic expressions generated in step 820.As such idiomatic expressions are encountered, a similarity metric (suchas, proximity) in the conceptual representation space is used toidentify likely translations of the idiom into the target-language(s).For example, these likely translations can be words or phrases from thetarget-language that are close to the vector representation for thesource-language idiom in the conceptual representation space.

The effectiveness of the method illustrated by flowchart 800 can beimproved by processing idiomatic expressions as described above for boththe source-language and the target-language in the same cross-lingualconceptual representation space. Note that the approach described abovecan be applied to both multiple source-languages and multipletarget-languages in a single conceptual representation space.

IX. Alternative Embodiments A. Anaphora Resolution

As demonstrated in the literature (Klebanov, B., and Wiemer-Hastings,P., 2002, “Using LSA for Pronominal Anaphora Resolution,” in Gelbukh, A.(ed.) Computational Linguistics and Intelligent Text Processing, LNCS2276, Springer Verlag, pp. 197-199), the specific technique of latentsemantic indexing (also referred to as latent semantic analysis) hasbeen shown to have potential utility in determining antecedents forpronoun references. An embodiment of the present invention usesconceptual representation spaces to resolve anaphora in the context of amachine translation system.

B. Logical Decomposition

In some languages, such as Arabic, it is not unusual to encountersentences that are very long in comparison to those typically found inEnglish. Such long sentences present a challenge to machine translationsystems. An embodiment of the present invention uses automatic lineartext segmentation (an implementation of which is described incommonly-owned U.S. patent application Ser. No. 11/316,837 entitled“Automatic Linear Text Segmentation” to Price, which corresponds to U.S.Published Application No. 2006/0224584, filed Dec. 27, 2005, theentirety of which is incorporated by reference herein) to subdividelengthy sentences into logically coherent subsets. These subsets canthen be translated as individual sentences.

Lengthy sentences may be subdivided in accordance with the followingexample method. First, all sentences contained in a source-languagedocument (such as Arabic) are identified. The sentences may beidentified using off-the-shelf software, such as a utility called“java.text.BreakIterator” provided within the Java™ 2 Platform. However,other well-known methods for determining sentence boundaries (such asidentifying all words between punctuation marks) can be used. Second,sentences that are longer than a cut-off value are partitioned intosmaller blocks of text, each block containing at least one candidatesubject, object, and verb. Third, each such block of text is representedin a conceptual representation space (such as an LSI space). Fourth,conceptual similarity scores are computed for adjacent blocks of textbased on the representations of the adjacent blocks of text in theconceptual representation space. In an example in which the conceptualrepresentation space is an LSI space, the conceptual similarity scorecan be a cosine similarity between the vector representation of adjacentblocks of text. Then, similar adjacent blocks of text are aggregatedinto conceptually cohesive segments based on the similarity scores. Theaggregation process continues so long as aggregation criteria aresatisfied.

After the lengthy sentences are subdivided into conceptually cohesivesegments, each conceptually cohesive sentence can be automaticallytranslated using methods described herein.

C. Data Fusion

A further embodiment of the present invention combines multipletranslation algorithms to produce a result of higher quality than any ofthe individual translations. This embodiment is one example of anapplication of data fusion methods in natural language processing andexploits the orthogonality among the errors produced by the individualtechniques that are combined. Several different approaches exist forcombining outputs from multiple translation algorithms (such as, voting,weighted voting, application of Dempster-Schafer theory of evidencecombination, etc.). Properties of a conceptual representation space canprovide additional possibilities for such combinations.

An embodiment of the present invention provides a method for combiningoutputs from multiple translation algorithms. The method includes: (i)for a given text passage (typically a sentence) creating multipletranslations from the source-language text to the target-language usingdifferent machine translation algorithms; (ii) generating vectorrepresentations for each of the multiple translations (for example, anLSI vector representation can be generated); and (iii) choosing wordsand phrases for the output translated text based on comparisons amongthe individual vector representations. Step (iii) can be performed inseveral different ways. For example, a vector representation can becalculated for each possible combination of words and phrases suggestedby the individual machine translation outputs. The combination of wordsand phrases that produces a vector representation closest to the averageof the vector representations can be chosen for the individual machinetranslation outputs.

D. Statistical Machine Translation

Two of the primary current approaches to machine translation areexample-based machine translation and statistical machine translation.These approaches make use of parallel corpora, from which statisticsabout a source-language and a target-language are derived. In accordancewith an embodiment of the present invention statistics for example-basedmachine translation and/or statistical machine translation approachesare derived based on a distribution of words (and phrases) in amultilingual conceptual representation space. These statistics may bemore robust than those generated by existing techniques. For example,current approaches to statistical machine translation typically arevariations on a technique described in P. F. Brown, et al., “TheMathematics of Statistical Machine Translation: Parameter Estimation,”19 Computational Linguistics 263 (1993) (“the IBM paper”). In theirtechnique, estimates of the degree of association of words in source andtarget language are determined based upon the statistics of alignmentsin translated pairs of sentences. In an embodiment of the presentinvention, proximity of source and target words in a conceptualrepresentation space provides a more powerful indication of suchassociation. This proximity measurement can be converted to anassociation probability and this probability directly inserted intomodels such as those described in the IBM paper.

A method in accordance with an embodiment of the present invention usessource-language and target-language statistics derived from a conceptualrepresentation space in an implementation of example-based machineand/or statistical machine translation. The more data (text in thesource and target-languages) that is taken into consideration in thegeneration of these statistics, the better.

E. Boot-Strapping To Create A Parallel Corpus Of Documents

A method in accordance with another embodiment of the present inventioncreates a cross-lingual document space through an iterative process. Themethod includes the following steps.

In a first step, an initial cross-lingual space is created. Thecross-lingual space can be created using known techniques (such as, thetechnique described in U.S. Pat. No. 5,301,109, entitled “ComputerizedCross-language Document Retrieval Using Latent Semantic Indexing,” whichissued Apr. 5, 1994).

In a second step, a quantity of documents in the languages of thecross-lingual space is collected. It is to be appreciated that more thantwo languages can be treated in one space at the same time.

In a third step, the collected documents are folded into thecross-lingual space. For example, the documents can be folded into thecross-lingual space according to the folding-in method as described inU.S. Pat. No. 4,839,853, entitled “Computer Information Retrieval UsingLatent Semantic Structure,” which issued Jun. 13, 1989.

In a fourth step, the closest pairs (sets) of collected documents in thetwo (or more) languages are identified. This could be the N closestpairs (sets) or all pairs (sets) closer than a given threshold. Both Nand the threshold can be determined heuristically.

In a fifth step, the pairs (sets) of documents identified in the fourthstep are treated as additional parallel documents in creating a nextiteration of the cross-lingual space. That is, these identified documentpairs are treated as additional document pairs (sets) for matrixcreation and singular value decomposition (SVD) processing in creating anew iteration of the cross-lingual space as in the first step.

In a sixth step, the fourth and fifth steps are repeated until no pairs(sets) are closer than a threshold (such as, an empirically determinedthreshold). In an alternative implementation of the sixth step, thesecond through fifth steps are repeated until there are no pairs closerthan the threshold.

It is to be appreciated that the above-described method creates a robustcross-lingual conceptual representation spaces and may be used inconjunction with any of the above-described methods in which across-lingual space is employed.

Being able to use monolingual data to create auxiliary structures formachine translation potentially makes several orders of magnitude moreinformation available. Typically only thousands to hundreds of thousandsof pages of true parallel text are available for most languages.However, there could be millions to hundreds of millions of pages ofmonolingual text available.

X. Example Computer System Implementation

Various aspects of the present invention can be implemented by software,firmware, hardware, or a combination thereof. FIG. 9 illustrates anexample computer system 900 in which an embodiment of the presentinvention, or portions thereof, can be implemented as computer-readablecode. For example, the methods illustrated by flowcharts 500, 600, 700,and 800 of FIGS. 5, 6, 7, and 8, respectively, can be implemented insystem 900. Various embodiments of the invention are described in termsof this example computer system 900. After reading this description, itwill become apparent to a person skilled in the relevant art how toimplement the invention using other computer systems and/or computerarchitectures.

Computer system 900 includes one or more processors, such as processor904. Processor 904 can be a special purpose or a general purposeprocessor. Processor 904 is connected to a communication infrastructure906 (for example, a bus or network).

Computer system 900 also includes a main memory 908, preferably randomaccess memory (RAM), and may also include a secondary memory 910.Secondary memory 910 may include, for example, a hard disk drive 912and/or a removable storage drive 914. Removable storage drive 914 maycomprise a floppy disk drive, a magnetic tape drive, an optical diskdrive, a flash memory, or the like. The removable storage drive 914reads from and/or writes to a removable storage unit 918 in a well knownmanner. Removable storage unit 918 may comprise a floppy disk, magnetictape, optical disk, etc. which is read by and written to by removablestorage drive 914. As will be appreciated by persons skilled in therelevant art(s), removable storage unit 918 includes a computer usablestorage medium having stored therein computer software and/or data.

In alternative implementations, secondary memory 910 may include othersimilar means for allowing computer programs or other instructions to beloaded into computer system 900. Such means may include, for example, aremovable storage unit 922 and an interface 920. Examples of such meansmay include a program cartridge and cartridge interface (such as thatfound in video game devices), a removable memory chip (such as an EPROM,or PROM) and associated socket, and other removable storage units 922and interfaces 920 which allow software and data to be transferred fromthe removable storage unit 922 to computer system 900.

Computer system 900 may also include a communications interface 924.

Communications interface 924 allows software and data to be transferredbetween computer system 900 and external devices. Communicationsinterface 924 may include a modem, a network interface (such as anEthernet card), a communications port, a PCMCIA slot and card, or thelike. Software and data transferred via communications interface 924 arein the form of signals 928 which may be electronic, electromagnetic,optical, or other signals capable of being received by communicationsinterface 924. These signals 928 are provided to communicationsinterface 924 via a communications path 926. Communications path 926carries signals 928 and may be implemented using wire or cable, fiberoptics, a phone line, a cellular phone link, an RF link or othercommunications channels.

In this document, the terms “computer program medium” and “computerusable medium” are used to generally refer to media such as removablestorage unit 918, removable storage unit 922, a hard disk installed inhard disk drive 912, and signals 928. Computer program medium andcomputer usable medium can also refer to memories, such as main memory908 and secondary memory 910, which can be memory semiconductors (suchas DRAMs, etc.). These computer program products are means for providingsoftware to computer system 900.

Computer programs (also called computer control logic) are stored inmain memory 908 and/or secondary memory 910. Computer programs may alsobe received via communications interface 924. Such computer programs,when executed, enable computer system 900 to implement the presentinvention as discussed herein. In particular, the computer programs,when executed, enable processor 904 to implement the processes of thepresent invention, such as the steps in the methods illustrated byflowchart 500 of FIG. 5, flowchart 600 of FIG. 6, flowchart 700 of FIG.7, and flowchart 800 of FIG. 8 discussed above. Accordingly, suchcomputer programs represent controllers of the computer system 900.Where the invention is implemented using software, the software may bestored in a computer program product and loaded into computer system 900using removable storage drive 914, interface 920, hard drive 912 orcommunications interface 924.

The invention is also directed to computer products comprising softwarestored on any computer useable medium. Such software, when executed inone or more data processing devices, causes a data processing device(s)to operate as described herein. Embodiments of the invention employ anycomputer useable or readable medium, known now or in the future.Examples of computer useable mediums include, but are not limited to,primary storage devices (such as, any type of random access memory),secondary storage devices (such as, hard drives, floppy disks, CD ROMS,ZIP disks, tapes, magnetic storage devices, optical storage devices,MEMS, nanotechnological storage device, etc.), and communication mediums(such as, wired and wireless communications networks, local areanetworks, wide area networks, intranets, etc.).

XI. Conclusion

It is to be appreciated that the Detailed Description section, and notthe Summary and Abstract sections, is intended to be used to interpretthe claims. The Summary and Abstract sections may set forth one or morebut not all exemplary embodiments of the present invention ascontemplated by the inventor(s), and thus, are not intended to limit thepresent invention and the appended claims in any way.

1. A computer-implemented method for translating text, comprising:generating a conceptual representation space based on a plurality ofsource-language documents and a plurality of target-language documents;generating, in the conceptual representation space, respectiverepresentations of a new source-language document and each of aplurality of dictionaries; selecting a first dictionary from theplurality of dictionaries responsive to a similarity between therepresentation of the new source-language document and therepresentation of the first dictionary; and translating, by using thefirst dictionary, a term in the new source-language document into atarget-language term.
 2. The method of claim 1, wherein generating arepresentation of each of the plurality of dictionaries comprises:concatenating terms in each of the plurality of dictionaries into asingle text object; and generating a representation of the single textobject in the conceptual representation space.
 3. The method of claim 1,wherein generating a representation of each of the plurality ofdictionaries comprises: subdividing each of the plurality ofdictionaries into conceptually cohesive segments; generating a textobject for each conceptually cohesive segment; and generating arepresentation of each text object in the conceptual representationspace.
 4. The method of claim 1, wherein the conceptual representationspace is a Latent Semantic Indexing (LSI) space.
 5. The method of claim1, further comprising: determining respective similarities between therepresentation of the new source-language document and therepresentation of each of the plurality of dictionaries.
 6. The methodof claim 5, wherein the similarity between the representation of the newsource-language document and the representation of the first dictionaryis greater than the other similarities.
 7. A computer-program productcomprising a computer-readable storage medium having instructions storedthereon that, if executed by a computing device, cause the computingdevice to perform a method for translating text, the method comprising:generating a conceptual representation space based on a plurality ofsource-language documents and a plurality of target-language documents;generating, in the conceptual representation space, respectiverepresentations of a new source-language document and each of aplurality of dictionaries; selecting a first dictionary from theplurality of dictionaries responsive to a similarity between therepresentation of the new source-language document and therepresentation of the first dictionary; and translating, by using thefirst dictionary, a term in the new source-language document into atarget-language term.
 8. The computer-program product of claim 7,wherein generating a representation of each of the plurality ofdictionaries comprises: concatenating terms in each of the plurality ofdictionaries into a single text object; and generating a representationof the single text object in the conceptual representation space.
 9. Thecomputer-program product of claim 7, wherein generating a representationof each of the plurality of dictionaries comprises: subdividing each ofthe plurality of dictionaries into conceptually cohesive segments;generating a text object for each conceptually cohesive segment; andgenerating a representation of each text object in the conceptualrepresentation space.
 10. The computer-program product of claim 7,wherein the conceptual representation space is a Latent SemanticIndexing (LSI) space.
 11. The computer-program product of claim 7,wherein the method further comprises: determining respectivesimilarities between the representation of the new source-languagedocument and the representation of each of the plurality ofdictionaries.
 12. The computer-program product of claim 11, wherein thesimilarity between the representation of the new source-languagedocument and the representation of the first dictionary is greater thanthe other similarities.
 13. A computing system, comprising: a memory;and a processor coupled to the memory, wherein the processor isconfigured to execute a method for translating text, the methodcomprising: generating a conceptual representation space based on aplurality of source-language documents and a plurality oftarget-language documents; generating, in the conceptual representationspace, respective representations of a new source-language document andeach of a plurality of dictionaries; selecting a first dictionary fromthe plurality of dictionaries responsive to a similarity between therepresentation of the new source-language document and therepresentation of the first dictionary; and translating, by using thefirst dictionary, a term in the new source-language document into atarget-language term.
 14. The computing system of claim 13, whereingenerating a representation of each of the plurality of dictionariescomprises: concatenating teens in each of the plurality of dictionariesinto a single text object; and generating a representation of the singletext object in the conceptual representation space.
 15. The computingsystem of claim 13, wherein generating a representation of each of theplurality of dictionaries comprises: subdividing each of the pluralityof dictionaries into conceptually cohesive segments; generating a textobject for each conceptually cohesive segment; and generating arepresentation of each text object in the conceptual representationspace.
 16. The computing system of claim 13, wherein the conceptualrepresentation space is a Latent Semantic Indexing (LSI) space.
 17. Thecomputing system of claim 13, wherein the method further comprises:determining respective similarities between the representation of thenew source-language document and the representation of each of theplurality of dictionaries;
 18. The computing system of claim 17, whereinthe similarity between the representation of the new source-languagedocument and the representation of the first dictionary is greater thanthe other similarities.